Statistical analysis of GENEOnet robustness

Giovanni Bocchi

Dept. of Environmental Science and Policy, University of Milan

Alessandra Micheletti

Dept. of Environmental Science and Policy, University of Milan

2025-04-02

Outline

  1. Motivation
  2. GENEOnet and GENEOs
  3. Molecular Dynamics simulations
  4. Robustness analysis

Motivation

AI in biochemistry


Today we observe a pervasive presence of AI in tackling complex biochemical challenges, such as:

  1. Predicting protein conformation
  2. Predicting protein-protein interactions
  3. Predicting protein-ligand interactions

However, the pace at which such AI systems have been developed has outstripped the development of explainable models or validations of their robustness.

Protein pocket detection


Protein pocket detection is a key problem in the context of drug discovery and design. It involves the identification of locations on a protein surface where small molecules (usually drugs) are likely to bind.

Protein pocket detection


Protein pocket detection is a key problem in the context of drug discovery and design. It involves the identification of locations on a protein surface where small molecules (usually drugs) are likely to bind.

GENEOs and GENEOnet

GENEOnet


GENEOnet [1] is a specialized GENEO [2] network model designed for detecting protein pockets, it features a shallow architecture composed of a small number of GENEO units.

Notably, it is an explainable by design model, we compared its performances with other state-of-the-art methods finding that it has better results despite its greater simplicity.

GENEOnet can be counted in the S.A.F.E. ML/AI framework [3]

GENEOs (informally)


GENEOnet was developed with GENEOs (Group Equivariant Non-Expansive Operators) which are mathematical tools that can be combined into network models featuring:

  1. Coherence with respect to geometrical transformations of the data
  2. Stability with respect to perturbations of the data

GENEOs


Fix two spaces of real valued functions \(\Phi\), \(\Psi\) and two groups \(G\), \(H\) of transformations of their domains.

Definition 1: (GENEOs) A map \(F \colon \Phi \to \Psi\) is called a Group Equivariant Non-Expansive Operator if, fixed \(T\colon G\to H\), the followings hold:

  1. \(F(\varphi \circ g) = F(\varphi) \circ T (g)\) for every \(\varphi \in \Phi\), \(g \in G\) (equivariance)

  2. \(||F(\varphi) - F(\varphi')||_{\infty} \le ||\varphi - \varphi'||_{\infty}\) for every \(\varphi, \varphi' \in \Phi\) (non-expansivity)

GENEOs


Fix two spaces of real valued functions \(\Phi\), \(\Psi\) and two groups \(G\), \(H\) of transformations of their domains.

Definition 1: (GENEOs) A map \(F \colon \Phi \to \Psi\) is called a Group Equivariant Non-Expansive Operator if, fixed \(T\colon G\to H\), the followings hold:

  1. \(F(\varphi \circ g) = F(\varphi) \circ T (g)\) for every \(\varphi \in \Phi\), \(g \in G\) (equivariance)

  2. \(||F(\varphi) - F(\varphi')||_{\infty} \le ||\varphi - \varphi'||_{\infty}\) for every \(\varphi, \varphi' \in \Phi\) (non-expansivity)

Molecular dynamics

Robustness analysis


Employ Molecular Dynamics (MD) simulations data to assess GENEOnet robustness to biologically relevant perturbations.

We retrieved MD simulations data from ATLAS [4] then:

  1. We selected the first \(T\) frames for each protein \(P\).
  2. For \(t=1,\dots,T\) we computed GENEOnet global prediction \(G(P_t)\) (i.e. the union of all predicted pockets).
  3. For \(t=2,\dots,T\) we computed the Overlap and the RMSD between the current and the preceding frame: \[ \begin{aligned} O_t (P) &= \frac{|G(P_{t-1} ) \cap G(P_t )|}{ |G(P_{t-1} )|} \\ RMSD_t (P) &= \sqrt{\frac{1}{N} \sum_{j=1}^N||\mathbf{x}_{t-1}^j - \mathbf{x}_t^j ||^2} \end{aligned} \]

Evaluation


To assess GENEOnet’s robustness [5] [6] we compared the distributions of Overlaps and RMSDs for the 37 proteins considered expecting that:

  1. Small values of RMSD should imply high values of Overlap.
  2. High values of RMSD may imply small values of Overlap since it is possible that the larger movements between frames impact areas where no pockets are predicted thus having smaller effects on the Overlap.

Comparing the distributions

Boxplots of RMSD and Overlap

Takehome message



GENEOs can be used to develop explainable by design AI models which are also robust to perturbations in the data as shown with GENEOnet and MD simulations.

Extended work




An extended version of this short work, featuring additional analysis and tests, has been recently published in Statistics.

Thank you for the attention!

Main references

[1]
G. Bocchi et al., GENEOnet: A new machine learning paradigm based on Group Equivariant Non-Expansive Operators. An application to protein pocket detection,” 2022.
[2]
M. G. Bergomi, P. Frosini, D. Giorgi, and N. Quercioli, “Towards a topological-geometrical theory of group equivariant non-expansive operators for data analysis and machine learning,” Nature Machine Intelligence, vol. 1, no. 9, pp. 423–433, 2019, doi: 10.1038/s42256-019-0087-3.
[3]
P. Giudici, “Safe machine learning,” Statistics, vol. 58, no. 3, pp. 473–477, 2024, doi: 10.1080/02331888.2024.2361481.
[4]
Y. Vander Meersche, G. Cretin, A. Gheeraert, J.-C. Gelly, and T. Galochkina, ATLAS: protein flexibility description from atomistic molecular dynamics simulations,” Nucleic Acids Research, vol. 52, no. D1, pp. D384–D392, 2023, doi: 10.1093/nar/gkad1084.
[5]
G. Bocchi et al., “A geometric XAI approach to protein pocket detection,” in Joint proceedings of the xAI 2024 late-breaking work, demos and doctoral consortium, CEUR-WS.org, 2024, pp. 217–224. Available: https://ceur-ws.org/Vol-3793/paper_28.pdf
[6]
G. Bocchi et al., GENEOnet: Statistical analysis supporting explainability and trustworthiness,” Statistics, vol. 0, no. 0, pp. 1–26, 2025, doi: 10.1080/02331888.2025.2478203.